Make eessi configure gpu node automatically #841

claudia-lola · 2025-10-28T15:31:24Z

Add tasks to eessi/configure.yml and compute-init.yml to run the EESSI link_nvidia_host_libraries.sh script on gpu nodes with nvidia drivers installed. The tasks will be run when either site.yml is run or a rebuild via slurm is completed.

gpu node automatically

sjpb · 2025-11-04T10:15:24Z

ansible/roles/compute_init/files/compute-init.yml

        name: basic_users
      when: enable_basic_users

    - name: EESSI


I think we should replace this whole EESSI block with running the configure task directly. But it is not obvious how to do this TBF ...

I've tried to describe it in https://wiki.stackhpc.com/doc/slurm-development-ZXjBRByl6K#h-only-inventory-vars

so you could change the entire eeesi thing to do this.

sjpb · 2025-11-04T10:31:13Z

ansible/roles/eessi/tasks/configure.yml

    cmd: "cvmfs_config setup"
+
+# configure gpus
+- name: Check for NVIDIA driver


I'm not sure whether there is always a /dev/nvidia0? Could you check with @jovial please for e.g MIG and vGPU configs? Else we'd have to do something like https://github.com/stackhpc/ansible-role-openhpc/blob/be6196540ca8007a0e45f2c3b2596ed0ff77fc13/library/gpu_info.py#L42 but TBH this approach here is much simpler!

Not in a slurm appliance, but on an openstack compute host with mig (it's hard for me to login into the slurm deployment with mig, but @priteau could check there):

[stack@gpu3 ~]$ ls -lia /dev/nvidia* 4226 crw-rw-rw-. 1 root root 503, 3 Feb 19 2025 /dev/nvidia-vgpu3 3610 crw-rw-rw-. 1 root root 503, 4 Feb 11 2025 /dev/nvidia-vgpu4 3665 crw-rw-rw-. 1 root root 503, 5 Feb 11 2025 /dev/nvidia-vgpu5 2205 crw-rw-rw-. 1 root root 503, 8 Jan 23 2025 /dev/nvidia-vgpu8 2091 crw-rw-rw-. 1 root root 503, 0 Jan 23 2025 /dev/nvidia-vgpuctl 2082 crw-rw-rw-. 1 root root 195, 0 Jan 23 2025 /dev/nvidia0 2086 crw-rw-rw-. 1 root root 195, 1 Jan 23 2025 /dev/nvidia1 2080 crw-rw-rw-. 1 root root 195, 255 Jan 23 2025 /dev/nvidiactl

So checking for /dev/nvidia0 looks good to me. You will likely not have the nvidia-vgpu* device nodes in a slurm deployment as they only appear if you create a vgpu instance.

sjpb · 2025-11-04T10:31:50Z

ansible/roles/eessi/tasks/configure.yml

    cmd: "cvmfs_config setup"
+
+# configure gpus
+- name: Check for NVIDIA driver


Strictly, this is checking for the device, which is only present if the driver is loaded. So I suggest:

Suggested change

- name: Check for NVIDIA driver

- name: Check for NVIDIA GPU

sjpb · 2025-11-04T10:31:59Z

ansible/roles/eessi/tasks/configure.yml

+    path: /dev/nvidia0
+  register: nvidia_driver
+
+- name: Set fact if NVIDIA driver is present


Suggested change

- name: Set fact if NVIDIA driver is present

- name: Set fact if NVIDIA GPU is present

…pliance into configure-gpus Apply changes to task names

claudia-lola requested a review from a team as a code owner October 28, 2025 15:31

claudia-lola self-assigned this Oct 28, 2025

claudia-lola force-pushed the configure-gpus branch from fbb37b9 to 7e7cb24 Compare October 28, 2025 15:31

Adding tasks to essi/configure.yml to make eessi configure

2c45c6c

gpu node automatically

claudia-lola force-pushed the configure-gpus branch from 7e7cb24 to 2c45c6c Compare October 28, 2025 15:50

Adding tasks to essi/configure.yml to make eessi configure

fa78672

gpu node automatically

claudia-lola force-pushed the configure-gpus branch from 2c45c6c to fa78672 Compare October 28, 2025 15:52

claudia-lola requested a review from sjpb October 28, 2025 16:51

sjpb added the requires-imagebuild label Oct 29, 2025

sjpb requested changes Nov 4, 2025

View reviewed changes

claudia-lola added 3 commits November 4, 2025 11:55

applying edits to task names

d5ac3f7

Merge branch 'configure-gpus' of github.com:stackhpc/ansible-slurm-ap…

2c22de1

…pliance into configure-gpus Apply changes to task names

replacing EESSI block with running the configure.yml task directly

5f1ddde

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make eessi configure gpu node automatically #841

Make eessi configure gpu node automatically #841

Uh oh!

claudia-lola commented Oct 28, 2025

Uh oh!

sjpb Nov 4, 2025

Uh oh!

sjpb Nov 4, 2025

Uh oh!

sjpb Nov 4, 2025

Uh oh!

jovial Nov 7, 2025

Uh oh!

sjpb Nov 4, 2025

Uh oh!

sjpb Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	- name: Set fact if NVIDIA driver is present
	- name: Set fact if NVIDIA GPU is present

Make eessi configure gpu node automatically #841

Are you sure you want to change the base?

Make eessi configure gpu node automatically #841

Uh oh!

Conversation

claudia-lola commented Oct 28, 2025

Uh oh!

sjpb Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

sjpb Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

sjpb Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

jovial Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

sjpb Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

sjpb Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants